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ABSTRACT 

This paper introduces a novel feature extraction technique for the analysis 
of spectral line Stokes profiles. The procedure is based on the use of an auto- 
associative artificial neural network containing non-linear hidden layers. The 
neural network extracts a small subset of parameters from the profiles (features), 
from which it is then able to reconstruct the original profile. This new approach 
is compared to two other procedures that have been proposed in previous works, 
namely principal component analysis and Hermitian function expansions. De- 
pending on the target application, each one of these three techniques has some 
advantages and disadvantages, which are discussed here. 

Subject headings: line: profiles - methods: data analysis - methods: numerical 
- Sun: atmosphere - stars: atmospheres 



1. Introduction 



A new breed of techniques for the diagnostics of solar magnetic fields has been emerg- 
ing during the last few years. The fundamental idea that inspired them is the recognition 
that the observable polarization (Stokes) line profiles exhibit conspicuous patterns of fea- 
tures associated with the underlying physical conditions in the atmosphere where they form. 
This has recently motivated a considerable interest in the field of pattern recognition. Po- 
tential applications to solar magnetic field diagnostics have been explored in a number of 
recent papers by Rees et al. (2000); Socas-Navarro et al. (2001); Carroll & Staude (2001); 



1 The National Center for Atmospheric Research (NCAR) is sponsored by the National Science Founda- 
tion. 



- 2 - 



Socas-Navarro & Sanchez Almeida (2002); Skumanich & Lopez Ariste (2002); Socas-Navarro 
(2003); del Toro Iniesta & Lopez Ariste (2003); Casini et al. (2003). 

Pattern recognition techniques were originally developed in the fields of artificial in- 
telligence and machine vision, but have found a vast number of applications over a broad 
spectrum of disciplines (see, e.g., the review by Rees & Guo 2003). Solar physicists have 
relied on least-squares fitting procedures to invert the observed spectra since the late 70s, 
although this only became a generalized practice in the early 90s (for more details see the 
reviews by Socas-Navarro 2001; del Toro Iniesta 2003). The least-squares approach is still as 
useful as ever and is not likely to be replaced by the newer pattern recognition in the near 
future. The reason is simply that one can obtain much more detailed information from a 
given set of profiles by using a least-squares inversion. It also offers more flexibility in the 
physical model used for the inversion. Research on new inversion techniques is intended to 
develop alternative algorithms that will complement, and not replace previous least-square 
procedures. 

Nevertheless, it is justified to devote significant efforts towards the development of alter- 
native inversion algorithms. Without going into much detail here (see, e.g., Socas-Navarro 
2002 for a more thorough discussion), it should be noted that the next generation spectro- 
polarimeters (both ground-based and space-borne) will deliver enormous dataflows that sim- 
ply cannot be analyzed with traditional inversion codes. The need for larger fields of view, 
higher spatial resolution and better time cadence is driving new instrumental developments. 
These, in turn, demand new analysis techniques that are fast, ideally operating in real-time, 
and robust, in the sense of not requiring any human supervision. Other important consider- 
ations that motivate research on pattern recognition are related to the need for establishing 
long-term databases of observations (properly inverted), or the desire to obtain real-time 
maps of the solar magnetic field at the observing site. 

As an intermediate step in this process, some authors have sought a suitable set of 
features in the Stokes profiles of spectral lines and an appropriate algorithm to extract 
such features. Feature extraction is of great potential interest, e.g. for data compression 
(which would help circumvent limited spacecraft telemetry), dimensionality reduction, profile 
classification, inversion pre-processing, etc. The most promising approaches developed so far 
seem to be the use of principal component analysis (PCA), introduced by Rees et al. (2000), 
and the expansion in Hermitian functions (EHF) of del Toro Iniesta & Lopez Ariste (2003). 
In this paper I present a new method which makes use of artificial neural networks (ANNs), 
and compare it to PCA and EHF in terms of performance and potential usefulness. 

The organization of the present paper is as follows. Section 2 below reviews the PCA 
and EHF methods and discusses some important ideas that are relevant for our purposes 
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here. Section 3 gives a detailed introduction to ANNs and puts forward the concept of auto- 
associative ANNs, which is the base for the feature extraction algorithm. The comparison 
among the three methods is presented in §4. Finally, the most important conclusions are 
discussed in §5. 



2. Profile expansion in basis functions 

Mathematically, the problem considered in this paper can be summarized as follows. 
Given a set of one or more Stokes profiles p k (\i) (where k = 1, ... ,4 denotes the Stokes 
parameter and Aj are the N\ wavelengths of the spectral samples), we wish to extract a 
suitable set of M parameters c (features) that contain as much information from the profiles 
as possible: 

c = H(p), (1) 

with H being a M-dimensional function that characterizes the feature-extraction technique. 
We can then reconstruct the original profile approximately by doing the inverse transforma- 
tion: 

p ~ p' = R(c) . (2) 

Three different strategies are explored in the next sections. The first two, PCA and 
EHF, are series expansions in terms of orthonormal basis functions. The third method, 
AANN, is somewhat more complex and allows for non-linear relations between the features 
and the profiles. 



2.1. Principal component analysis 

PCA has been used in a broad variety of fields (e.g., Rees & Guo 2003) and was recently 
introduced in the radiative transfer literature by Rees et al. (2000). At its core, PCA 
is simply a series expansion in terms of some orthonormal basis functions. If p(Aj) is a 
discretized spectral profile, we have: 

N 

p(A0 = X^Ai) , (3) 
j'=i 

where iV is the number of wavelength samples, b>(\i) are the basis functions (sometimes 
referred to as eigenprofiles or eigenvectors), and Cj are the coefficients of the expansion. 
These coefficients are sometimes referred to as features or eigenfeatures. 
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The basic premise that makes PCA so special is that the basis is cleverly chosen for 
each particular problem that one is dealing with. Instead of using a predefined set of basis 
functions, PCA seeks an optimum (in the sense defined below) set of eigenprofiles fr'(A). The 
first obvious condition that the b>(\) must meet in order to form a basis of the profile space 
is orthonormality: 

TV 

J2V(\)b k (\i) = S jk , (4) 
i=i 

where djk is the Kronecker delta. This condition implies that the expansion coefficients Cj 
can be determined simply by the following scalar product: 

TV 

Cj = 5>(A^(\). (5) 

i=i 

Equation (3) is exact because the sum extends to as many as N expansion terms. 
However, the whole point of feature extraction is precisely to reduce the dimensionality of 
the problem. This goal is achieved by truncating the expansion at a given order M (M < N), 
retaining only those terms that are deemed relevant. Consider a statistical sample of N p 
profiles, denoted by (with j = 1, . . . , N p ). The individual components of a vector pj are 
the observable spectral samples (p? = p*(\i)). Let us approximate these profiles by: 

M TV 

pi~^ c *bi+ ejW, (6) 

j=l j=M+l 

where e is a constant vector. When the expansion is truncated at order M, the total quadratic 
error accumulated over the entire sample is: 

N p N 
i=l j=M+l 

Notice that, so far, we have not particularized to PCA or given an explicit form for 
the bj. The discussion above applies to any expansion in basis functions. The distinctive 
feature of PCA is that it provides a basis that minimizes the truncation error for any order 
M. Minimization of Em with respect to straighforwardly leads to: 
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y 1=1 

The truncation error E M can also be minimized with respect to the basis functions b-i. One 
then obtains: 

Cb> = \W , (9) 
where C is the covariance matrix, defined as: 

C = ^(p i -p)(p i -pf. (10) 

1=1 

In Eq (10) above, p is the average profile over the statistical sample and T denotes the usual 
matrix transposition operation. 

Equation (9) shows that the PCA basis functions are simply the eigenvectors of the 
covariance matrix. Hence the denominations of eigenfunctions or eigenprofiles that they 
sometimes receive. The PCA expansion procedure can be summarized as follows. First, one 
picks a suitable database of profiles that are deemed representative of the physical problem 
being tackled. This database should ideally span a broad range of profiles, covering all 
the possible realizations that we might expect to find in the observations. The next step 
is to construct the covariance matrix C by using Eq (10), and solve the corresponding 
eigenvalue problem (Eq [9]). The eigenvectors of C are a basis of orthonormal profiles W 
(j — 1, . . . , N) with some interesting properties. In particular, it provides the optimum linear 
coordinate transformation for our problem (assuming that the original database is a good 
representation), in the sense that the information preserved upon truncation at any order 
M is maximum. 



2.2. Hermite functions 

The Hermite functions constitute an orthonormal basis of £ 2 , the space of integrable 
functions. They can be calculated as the product of a Gaussian by the Hermite polynomials 
(with some normalization constants): 

H n (x) = J^ eM-x 2 )n n (x) . (11) 
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The Hermite polynomials that appear in Eq (11) are given by the following recursive 
relations: 



n° = i 
n 1 = 2x 

H n = 2(n-l)xH n _ 1 -2H n _ 2 . (12) 

The H n (x) defined by in Eqs (11) and (12) are orthonormal. As with PCA above, any 
given profile may be expanded as linear combination of this basis: 

N 

p(x) ^ //M.r). (13) 
j'=i 

Obviously, the optimal expansion is attained when x is the wavelength relative to line center, 
and normalized to the width of the line. 



3. Artificial Neural Networks 

Before going into details on the particular physical problem of spectral line analysis, a 
general introduction to ANNs is probably in order. An ANN is an abstract structure that 
mimics to some extent the functioning of a biological neural system. Consider a number 
of memory cells (neurons), each one having the ability to store a datum. The neurons are 
inter-connected (synaptic connections) in such a way that the number they store can be 
modified according to the values stored in other neighboring neurons that is is connected to 
(see Fig 1). 

ANN investigations usually employ some simplifying restrictions on the structure of 
neurons and synaptic connections. One of them is to consider forward-propagating networks, 
in which a signal propagation direction is defined. The synaptic connections seen by a neuron 
can be inward- or outward-directed, depending on whether it is used to receive or send a 
signal from or to a neighbor. Some neurons are taken as inputs in which the signal originates 
and have no inward connections. The neurons connected to the inputs are then modified 
according to the values in the input neurons. Such modifications in turn trigger a new set 
of neurons to alter their values, and the process continues until the output neurons (which 
have no further outwards connections) are reached. Forward-propagating networks are not 
allowed to have loopback connections. 
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In order to fix ideas let us consider a special type of forward-propagating ANN, namely 
the multi-layer perceptron (MLP). This is perhaps the most widely used ANN geometry. A 
MLP is arranged in successive layers, as sketched in Fig 1. Every neuron in a given layer is 
connected with all neurons in the previous layer. No conexions are allowed between neurons 
in the same layer, or between layers that are not successive. 
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Fig. 1. — Schematic representation of a multi-layer perceptron. 
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Let us discuss in some more detail the actual process of signal propagation. We start 
with a set of inputs Xi (with i = 1, . . . , N). Denoting by the datum stored in neuron n 
belonging to layer and defining / = to be the input layer, we have that 

Y? = Xi . (14) 
The basic equation for signal propagation in a MLP is: 

Y n = f(jt<j Y t 1 +^ > ( 15 ) 

with / = 1, . . . , Ni a y ers and n = 1, . . . , N. The W^j and f3 l n are the synaptic weights and 
biases, respectively They are parameters of the network that define its behavior. The 
function / is the so-called activation function. A MLP is said to be linear when f(x) = x. 
In this case the value in a given neuron is simply a linear combination of the data in its 
neighbors. For many interesting applications, however, non-linear activation functions are 
chosen. One of the most widely employed functions is the hyperbolic tangent, which shall 
be used in this paper as well. 

From a mathematical point of view, a forward-propagating ANN may be viewed as a 
multi-dimensional non-linear function F that transforms a vector of inputs Xi (with i = 
1, . . . , N) onto a vector of outputs Oj (with j — 1, . . . , M). The input and output spaces are 
not necessarily of the same dimension (N ^ M). Formally: 



o = F(x) . (16) 

The idea is to use F as an approximation to the physical model under consideration. 
What is distinctive about ANNs is their ability to "learn" . This means that the actual form 
of F is altered during a training process, making it a better approximation. 

The usual procedure to train a MLP is by generating a set of training data, consisting of 
known inputs x\ and outputs (sometimes referred to as targets) o\. The training inputs are 
propagated through the network, producing some outputs Oj, which in general differ from 
the targets by an error 5{ = Oj — o*. One then modifies the network parameters W-j and (3\ to 
minimize the sum of quadratic errors. Several algorithms exist in the literature to perform 
this task. Perhaps the most widely used is the back-propagation algorithm (Rumelhart et al. 
1986), which has been employed for the calculations in the present paper. Computer routines 
in Fortran 90 to create and train MLPs are available from the author upon request. 
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3.1. Auto-associative networks 

Dimensionality reduction (and feature extraction) can be achieved by using auto- associative 
ANNs (hereafter AANNs). An AANN is trained with targets that are equal to the inputs of 
the training set (o* = xf). This would not be very useful without another distinctive feature 
of AANNs, namely a reduced number of neurons in one (or more) of the hidden layers (the 
"bottleneck" layer). The resulting structure for a simple AANN is shown in Fig 2. 

When the AANN in Fig 2 is trained with o\ — x\, it needs to find a way to compress 
the input data into a smaller subset of numbers (in the particular example of the figure, 
going from 5 to 3) from which the input profile can be later reconstructed. When the AANN 
has only one hidden layer (as the one in the figure) and the activation function is linear, 
it can be shown that the AANN training has one global solution (Bourlard & Kamp 1988; 
Baldi & Hornik 1989). This solution is precisely the same that one wold obtain with a PCA 
decomposition. In other words, a properly trained linear AANN with one hidden layer is able 
to extract PCA coefficients from an input profile and then reconstruct the original profile 
from these coefficients. 
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Fig. 2. — Schematic representation of an auto-associative network. This particular network 
has one hidden layer with 3 neurons. The input and output layers have 5 neurons. 
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Using AANNs to do PCA decomposition is not very useful. There are simpler methods 
that are guaranteed to work and do not require a cumbersome, time-consuming training 
process. However, there may be instances in which it is convenient to use multi-layer non- 
linear AANNs, which are effectively some form of non-linear generalization of PCA. Extreme 
examples are the academic problems represented in Fig 3. 
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Fig. 3. — Two simple problems that illustrate the advantages of non-linear coordinate trans- 
formations. 
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In the (x, y) coordinate system, one needs two values to entirely determine the position 
of the points in the figure. A PCA analysis of the dataset would result in another orthogonal 
reference system (p, v\ which is simply a rotation of (x, y) thus requiring also two numbers 
for each datapoint. Obviously, the optimal coordinate choice for these problems would be a 
system that has one coordinate rj along the curve and another p perpendicular to it. In this 
new sytem, specifying 77 alone practially determines the location of the datapoints. 

We can say in this case that the "intrinsic dimensionality" of the problem is D^m = 1, 
which is smaller than that of the "measurement" space D meas = 2. A similar reasoning 
may be applied to the analysis of spectral Stokes profiles. Typical observations of a line 
profile consist of ~100 spectral samples times 4 Stokes parameters. However, the intrinsic 
dimensionality of the physical problem is usually much smaller. For instance, a simple 
Milne-Eddington model is specified by ~10 atmospheric parameters. 

The mapping from (x, y) to (rj, p) in the examples above is non-linear and therefore 
cannot be picked up by the PCA decomposition. However, a properly trained AANN may 
be able to perform this mapping. This is an interesting property with potential interest for 
spectral line diagnostics. 

4. Comparison 

In this section I present some results from application of the three methods (PCA, EHF 
and AANN) to the analysis of simulated solar profiles. A word of caution is in order, since 
these results depend on the particular conditions of the problem under study, such as the 
set of profiles used for the simulation, the model atmosphere considered, etc. Therefore, the 
performances reported here are to be taken as merely orientative. 

For a sake of computational simplicity, the simulations have been carried out with Milne- 
Eddington models that have fixed thermodynamical parameters, corresponding to typical 
solar conditions. The magnetic field vector varies randomly, with the field strength ranging 
between and 4000 G, and both the inclination and azimuth between and 180 degrees. 2 
The spectral lines simulated are the well-known pair of Fe I lines at 6302 A. Random noise 
at the level of 10~ 3 was added to all profiles. 

Several sets of 5000 profiles were computed from random models. One of such sets was 
used to calculate the PCA bj eigenprofiles (see the discussion preceding Eq [9]). The rest were 



2 Notice that it is not necessary to simulate azimuths greater than 180 degrees because of the well-known 
ambiguity of the Zecman-induccd profiles. 
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used to train an AANN with a total of 4 non-linear layers. The spectral ranges corresponding 
to the two telluric lines in this region have been removed from the input profiles, but the 
network is still required to provide the full profile on output. The "bottleneck" layer of the 
AANN contains 10 neurons, whereas the input and output layers have 320 each. The reason 
for chosing 10 neurons in the bottleneck layer is that this is the intrinsic dimensionality of 
the profiles (given here by the number of free parameters in the Milne-Eddington model 
atmosphere) . 

After a given training batch has been processed the ANN is presented with a set of 
500 test profiles (not included in the training) to measure its performance. The training 
process was stopped when two successive batches yield no significant improvement in this 
test. The total CPU time for the training process was over 30 hours running on a 2.4 GHz 
Intel Pentium 4 processor. Fig 4 shows the final reconstruction accuracy of the AANN for a 
sample set of Stokes profiles. 
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Fig. 4. — Simulated Stokes profiles (solid) normalized to the continuum intensity (J c ), and 
fit from the AANN (dashed). 
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A reference sample of 500 spectra was analyzed and reconstructed using PCA, the 
EHF and the AANN. For each spectrum, the difference between the simulated and the 
reconstructed profiles is quantified as: 



where N^ata is the number of spectral samples multiplied by the number of Stokes parameters 
considered, I sim is the reference profile, I rec is the reconstruction and <7j is the measurement 
noise. Using this definition, the error in the reconstruction shown in Fig 4 is x 2 = 10.27. 

Figure 5 shows the x 2 , averaged over the reference sample, as a function of the coefficients 
used in the PCA and EHF. In the case of the AANN the number of parameters is fixed and 
determined by the structure of the network. The vertical line marks the number of neurons 
in the bottleneck layer. 

Not surprisingly, the EHF exhibits the lowest accuracy. However, this method still 
has merit as it is the only one that has analytical basis functions. Moreover, it does not 
require any "a priori" computations, unlike the AANN (which needs training) or PCA (which 
needs a database). The AANN works well when high accuracy is not required and the 
model atmospheres are relatively simple (see the reconstruction in Fig 4). Unfortunately, 
the computational expense increases rapidly for more complex models or higher accuracy 
demands, rendering this technique impractical for some problems. Both PCA and the EHF 
ultimately converge to zero error when a sufficiently large number of expansion coefficients 
is employed. This is probably a desirable feature for some applications, depending on the 
reconstruction error tolerance. Once it has been trained, the AANN forward-propagation 
is very fast because it only requires the calculation of for each neuron using Eq (15). 
Notice that we only need to propagate the signal from the inputs to the bottleneck layer 
(in the case of feature extraction), or from the bottleneck to the outputs (in the case of 
profile reconstruction). In terms of computational requirements, none of the three techniques 
considered here presents a clear advantage over the others. 



The development of new instrumentations and the need to process ever increasing 
dataflows is turning the attention of solar physicists to pattern recognition techniques (see 
references in §1). The present work can be considered a follow up to the papers by Rees et al. 
(2000) and del Toro Iniesta & Lopez Ariste (2003), introducing a new technique for feature 
extraction and comparing it to the proposed PCA and EHF proposed in those papers. 




Ndat, 
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5. 



Conclusions 




Fig. 5. — Average x 2 f° r a reference sample of 500 random profiles. Solid: PCA. Dashed: 
EHF. The horizontal and vertical lines mark the reconstruction error of the AANN and the 
number of neurons in the bottleneck layer, respectively. 
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The AANN expansion is based on a MLP which is trained to reproduce the profile 
supplied in the input neurons. However, one (or more) of the MLP hidden layers has fewer 
neurons than the inputs/outputs layers. Upon successful training, the AANN is able to 
extract a small set of features from a profile. These features are numbers stored in the 
bottleneck layer, from which the entire profile can be reconstructed. This works effectively 
as a decomposition/expansion scheme in which the number of coefficients is set by the number 
of neurons in the bottleneck layer. 

The comparison among the three feature extraction techniques does not reveal a "win- 
ner". Instead, each method has its own advantages and disadvantages. Depending on the 
particular problem it may be more interesting to use one or another. 

Ultimately, the goal is to use the expansion coefficients from whichever method one 
might choose for the implementation of some inversion procedure. The most straightforward 
way to do this would be a look-up table, as Socas-Navarro et al. (2001) did for PCA. Other 
alternatives include neural networks or support vector machines, which could benefit from 
the dimensionality reduction offered by these techniques as a pre-processing layer. 
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